Conversation
|
Review updated until commit 0d7d4ce Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
1 similar comment
|
!test |
5675a6c to
39825e0
Compare
|
!test |
Results look great! I couldn't find how you run TransformerEngine; otherwise I could check this by myself. I assume you've turned on their userbuffer overlapping as in https://github.com/NVIDIA/TransformerEngine/blob/66f9b3cbae214d521ac18883fe9a386b8893b179/examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py#L50? |
I think it's correct, but I still need to confirm my results and double-check with TE team that things are run the correct way. I used ddlb to run TE, check the refrence here: https://github.com/samnordmann/ddlb/blob/main/ddlb/primitives/TPColumnwise/transformer_engine.py |
|
!test |
Improve printing of HostIrContainer by printing the index computations which are not explicitly part of the `topLevelExprs`. Example from #5259 ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i84 ) T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer Index definitions: i111 = streamIdx % numberOfStreams; i90 = i88 % 8; i32 = i30 * 1024; i30 = 8 * 128; i86 = rank - streamIdx; i82 = rank + streamIdx; i74 = 8 * 128; i76 = i74 * 1024; i84 = i82 % 8; i88 = 8 + i86; ```
Adds a lowering path to generate a p2p ring pipeline backed by our recent cuda ipc backend. The performance look great, and even beats transformer engine for large matrix sizes, e.g., for TP columnwise (i.e. AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the different implementations reads as follows: - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf as a baseline pytorch eager implementation - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs - Transformer Engine: 660 TFLOPs <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42" src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69" /> This was measured using [DDLB](https://github.com/samnordmann/ddlb) and [this Fuser's branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased), on a single 8*H100 DGX node This PR is dependent on - #4466. Without the Allocation Cache, a rank might change the allocated buffer accross iteration. Besides being a performance issue, it can create a hang if the ipc cache is not hit uniformly accross rank. A long term better solution would be to use pytorch's recent symmetric allocator - (for performance only) #5325 The test written in the PR expresses a matmul ``` C = matmul(A,B), where - A [DIDx(d), M/d, K] - B[K,N], - C[Stream(d), M/d, N] ``` The generated host program is: ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer ```
Improve printing of HostIrContainer by printing the index computations which are not explicitly part of the `topLevelExprs`. Example from #5259 ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i84 ) T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer Index definitions: i111 = streamIdx % numberOfStreams; i90 = i88 % 8; i32 = i30 * 1024; i30 = 8 * 128; i86 = rank - streamIdx; i82 = rank + streamIdx; i74 = 8 * 128; i76 = i74 * 1024; i84 = i82 % 8; i88 = 8 + i86; ```
Adds a lowering path to generate a p2p ring pipeline backed by our recent cuda ipc backend. The performance look great, and even beats transformer engine for large matrix sizes, e.g., for TP columnwise (i.e. AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the different implementations reads as follows:
This was measured using DDLB and this Fuser's branch, on a single 8*H100 DGX node
This PR is dependent on
The test written in the PR expresses a matmul
The generated host program is: